Project Overview

Total Students Analyzed
4,274
Dataset Dropout Rate
32.9%
U.S. National Dropout Rate
39%
NY Non-Completers Growth
+2.3%
1.9M individuals

Background

While high school dropout rates are declining, a staggering 32.9%[1] of U.S. college students still fail to graduate. The core priority for education leaders lies in bridging these gaps by identifying the one-in-four students currently at risk of dropping out before they lose momentum.

In New York alone, there are over 1.9 million individuals with some college but no credential, a population that grew by 2.3%[2] in the past year.

Our Mission

This app aims to reduce academic failure by using machine learning to identify at-risk students at an early stage. By providing actionable insights, we enable educators to implement timely support strategies that keep students on the path to success.

Framework & Solutions

We provide a systematic approach to student success across three dimensions:

Student-Level

Individual risk profiles for personalized intervention

View Details →

Cohort-Level

Trend analysis for specific student demographics

View Details →

System-Level

Strategic dashboards for school and district leadership

View Details →

Data Reference

This project utilizes a dataset from the UCI Machine Learning Repository to present these real-world challenges. Originally created to help reduce academic attrition in higher education, this data allows us to demonstrate how machine learning can effectively flag students at risk during their academic journey.

Created by: Qi Zhao zhaoq23009@gmail.com

Last updated: February 09 2026

Student-Level Actions - Personalized Interventions

🔴 Critical Priority Students (Top 10)

Student ID Priority Risk Score Top Risk Factor Recommended Action
#1384 Critical 99.88% Tuition fees up to date Immediate intervention recommended: contact within 48 hours
#2454 Critical 99.87% Tuition fees up to date Immediate intervention recommended: contact within 48 hours
#3883 Critical 99.79% Tuition fees up to date Immediate intervention recommended: contact within 48 hours
#1403 Critical 99.79% Tuition fees up to date Immediate intervention recommended: contact within 48 hours
#4191 Critical 99.75% Tuition fees up to date Immediate intervention recommended: contact within 48 hours
#888 Critical 99.75% CR_Sem2 Immediate intervention recommended: contact within 48 hours
#3476 Critical 99.75% CR_Sem2 Immediate intervention recommended: contact within 48 hours
#3961 Critical 99.75% CR_Sem2 Immediate intervention recommended: contact within 48 hours
#3987 Critical 99.74% Tuition fees up to date Immediate intervention recommended: contact within 48 hours
#2455 Critical 99.72% CR_Sem2 Immediate intervention recommended: contact within 48 hours

🟡 Warning Priority Students (Top 10)

Student ID Priority Risk Score Top Risk Factor Recommended Action
#1727 Warning 69.66% Curricular units 2nd sem (approved) Proactive monitoring recommended: monthly check-ins
#4174 Warning 69.04% CR_Sem2 Proactive monitoring recommended: monthly check-ins
#912 Warning 68.69% CR_Sem1 Proactive monitoring recommended: monthly check-ins
#1505 Warning 68.60% Tuition fees up to date Proactive monitoring recommended: monthly check-ins
#3829 Warning 68.36% CR_Sem2 Proactive monitoring recommended: monthly check-ins
#2842 Warning 68.22% Tuition fees up to date Proactive monitoring recommended: monthly check-ins
#283 Warning 67.53% CR_Sem2 Proactive monitoring recommended: monthly check-ins
#2694 Warning 67.27% CR_Sem2 Proactive monitoring recommended: monthly check-ins
#152 Warning 67.17% Course Proactive monitoring recommended: monthly check-ins
#3496 Warning 66.70% Tuition fees up to date Proactive monitoring recommended: monthly check-ins

🟢 On Track Students (Top 10)

Student ID Priority Risk Score Top Risk Factor Recommended Action
#1686 On Track 39.88% CR_Sem1 On track: maintain current support level
#207 On Track 39.80% Mother's occupation On track: maintain current support level
#243 On Track 39.66% CR_Sem1 On track: maintain current support level
#1605 On Track 39.42% CR_Sem2 On track: maintain current support level
#2670 On Track 39.11% CR_Sem2 On track: maintain current support level
#3750 On Track 38.99% Course On track: maintain current support level
#1416 On Track 38.89% Course On track: maintain current support level
#759 On Track 38.70% Curricular units 2nd sem (approved) On track: maintain current support level
#2247 On Track 38.62% Delta_CR On track: maintain current support level
#1961 On Track 38.51% Tuition fees up to date On track: maintain current support level

What-If Analysis - Intervention Simulation

Sample Student: #1384 - Baseline Dropout Risk: 99.88%

Scenario Description New Risk Risk Reduction Reduction %
Baseline No intervention 99.88%
Academic Support Increase 2nd semester grade by +10 and evaluations by +5 99.86% 0.02% 0.0%
Financial Support Set tuition fees up to date and clear debtor status 99.72% 0.16% 0.2%
Socio-emotional Support Increase stress/support-related features by 20% (proxy) 99.89% -0.01% -0.0%
Key Finding: The most effective intervention is Financial Support, reducing dropout risk by 0.16%.

Cohort-Level Actions - Resource Allocation

Course Risk Analysis

Courses ranked by average dropout risk of enrolled students.

Rank Course Code Course Name Students Avg Risk Priority Recommended Action
#1 33 Biofuel Production Technologies 2 85.69% Critical Immediate intervention: assign dedicated advisor and increase TA support
#2 9119 Informatics Engineering 37 52.75% High Increase TA support and monitor weekly
#3 9991 Management (evening) 53 52.42% High Increase TA support and monitor weekly
#4 9003 Agronomy 39 47.49% High Increase TA support and monitor weekly
#5 9130 Equinculture 30 41.50% High Increase TA support and monitor weekly
#6 9853 Basic Education 37 40.07% High Increase TA support and monitor weekly
#7 9773 Journalism and Communication 70 37.34% Medium Monitor closely and provide supplemental resources
#8 8014 Social Service (evening) 50 37.26% Medium Monitor closely and provide supplemental resources
#9 171 Animation and Multimedia Design 45 36.99% Medium Monitor closely and provide supplemental resources
#10 9556 Oral Hygiene 15 35.38% Medium Monitor closely and provide supplemental resources
Methodology: Average dropout risk calculated as mean predicted probability across all students.

Cohort Comparison - Radar Chart Analysis

With Scholarship vs Without Scholarship (5 Dimensions)

Dimension With Scholarship Without Scholarship Difference
Academic Prep 0.38 0.36 +0.02
Current Success 0.47 0.37 +0.10
Engagement 0.26 0.26 0.00
Financial Stress 0.95 0.89 +0.06
Stability 0.22 0.29 -0.07
Interpretation: Students With Scholarship show stronger Current Success (Δ=0.10)

Group 1 vs Group 0 (5 Dimensions)

Dimension Group 1 Group 0 Difference
Academic Prep 0.36 0.37 -0.01
Current Success 0.34 0.43 -0.09
Engagement 0.25 0.26 -0.01
Financial Stress 0.88 0.91 -0.03
Stability 0.31 0.25 +0.06
Interpretation: Students Group 0 show stronger Current Success (Δ=-0.09)

Group 1 vs Group 0 (5 Dimensions)

Dimension Group 1 Group 0 Difference
Academic Prep 0.39 0.37 +0.02
Current Success 0.40 0.40 0.00
Engagement 0.26 0.26 0.00
Financial Stress 0.90 0.90 0.00
Stability 0.29 0.27 +0.02
Interpretation: Students Group 1 show stronger Academic Prep (Δ=0.02)

Entity-Driven Feedback Matrix

Natural Language Processing Analysis: Student feedback surveys analyzed using entity extraction and sentiment classification to identify actionable improvement areas.

Entity-Driven Feedback Matrix
Quadrant Interpretation:
  • Success Cases (Top Right): High-frequency, positive topics - scale these practices
  • Critical Issues (Bottom Right): High-frequency, negative topics - immediate intervention required
  • Niche Strengths (Top Left): Low-frequency, positive topics - monitor and expand
  • Minor Frustrations (Bottom Left): Low-frequency, negative topics - low priority
Key Action: Prioritize "Math Homework Volume" and "Lab Report Guidelines" for immediate review, while expanding successful practices like "Group Discussions" to other courses.

System-Level Actions - Policy Interventions

Top 10 Global Risk Factors

Features with the highest mean absolute SHAP values across all students.

Rank Feature Mean |SHAP| Policy Action
1 CR_Sem2
0.5711
Monitor the factor and evaluate targeted interventions
2 Tuition fees up to date
0.3105
Establish emergency financial aid funding and improve payment flexibility
3 Curricular units 2nd sem (approved)
0.2479
Strengthen academic support programs and targeted tutoring
4 CR_Sem1
0.2198
Monitor the factor and evaluate targeted interventions
5 Course
0.1703
Monitor the factor and evaluate targeted interventions
6 Unemployment rate
0.1507
Monitor the factor and evaluate targeted interventions
7 Curricular units 2nd sem (grade)
0.1319
Strengthen academic support programs and targeted tutoring
8 Gender
0.1293
Monitor the factor and evaluate targeted interventions
9 Age at enrollment
0.1255
Monitor the factor and evaluate targeted interventions
10 Admission grade
0.1165
Strengthen academic support programs and targeted tutoring
Interpretation: These features represent institution-wide patterns that require systemic interventions.

Longitudinal Monitoring of Student Climate

NLP-Powered Sentiment Analysis & Entity Attribution

Automated Monitoring System: Monthly sentiment scores derived from student communications (emails, forum posts, surveys) with spaCy-powered entity extraction to identify structural friction points.

Sentiment Timeline
Average Sentiment
68.1
Academic Year 2024-25
Alert Triggers
2
Below Threshold (50)
Top Entity (Dec)
LMS Update
-15 point drop
Top Entity (Apr)
Regents Exam
-11 point drop
December 2024 Alert: Sentiment dropped to 45 following LMS system update. Analysis of student communications revealed widespread login issues and lost assignment data. Action Taken: Emergency technical support hours and deadline extensions implemented.
April 2025 Alert: Sentiment declined to 40 during Regents Exam period. Entity analysis showed concerns about exam format changes and study resource availability. Recommendation: Improve exam preparation communication and expand review sessions.
Methodology: Sentiment scores computed using transformer-based models (DistilBERT) on student text data. Named entity recognition (spaCy) extracts topics/events during negative spikes. System auto-generates alerts when sentiment drops below baseline for two consecutive data points.

Policy Simulation - Cost-Benefit Analysis

Baseline Metrics

Total Students
885
High-Risk Students
204
Baseline Dropout Rate
27.2%

Intervention Scenarios

Financial Aid Program

Description: Clear debt for 44 high-risk students with debt (subset of 204 high-risk students)
Beneficiaries: 44 students
Total Cost: $88,000
Dropouts Prevented: 13
Cost per Student Saved: $6,769
New Dropout Rate: 25.76%

Engagement Nudge System

Description: Proxy boosts to engagement/achievement features (+30%) for all 204 high-risk students
Beneficiaries: 204 students
Total Cost: $500
Dropouts Prevented: 2
Cost per Student Saved: $250
New Dropout Rate: 27.01%
Recommendation: Prioritize interventions with lowest cost per student saved.

Intake & Quality Checks

Weekly intake → schema alignment → integrity audit → target balance & drift

Basic checks

rows cols missing_cells missing_cell_pct duplicate_rows has_missing has_duplicates
4424 37 0 0.0 0 No No

Column type summary

dtype n_cols
Int64 29
float64 7
str 1

Details (click to expand):

Int64 — 29 columns
Marital status, Application mode, Application order, Course, Daytime/evening attendance, Previous qualification, Nacionality, Mother's qualification, Father's qualification, Mother's occupation, Father's occupation, Displaced, Educational special needs, Debtor, Tuition fees up to date, Gender, Scholarship holder, Age at enrollment, International, Curricular units 1st sem (credited), Curricular units 1st sem (enrolled), Curricular units 1st sem (evaluations), Curricular units 1st sem (approved), Curricular units 1st sem (without evaluations), Curricular units 2nd sem (credited), Curricular units 2nd sem (enrolled), Curricular units 2nd sem (evaluations), Curricular units 2nd sem (approved), Curricular units 2nd sem (without evaluations)
float64 — 7 columns
Previous qualification (grade), Admission grade, Curricular units 1st sem (grade), Curricular units 2nd sem (grade), Unemployment rate, Inflation rate, GDP
str — 1 columns
Target

Outcome distribution

No data available.

Data Quality Checks

1. Missing Data Check

No missing data issues found.

2. Outlier Check

column outlier_count outlier_pct handling rationale
Age at enrollment 156 3.526221 flag_only High age represents non-traditional students (Risk Signal)
Curricular units 1st sem (evaluations) 33 0.745931 flag_only Default: Flag for monitoring without modification
Curricular units 2nd sem (evaluations) 15 0.339060 flag_only Default: Flag for monitoring without modification

3. Sanity Check (Integrity Audit)

Info (Review)
check_name severity affected_rows details
tuition_vs_scholarship_review INFO 46 Scholarship=1 but tuition=0. Review: partial scholarship or payment timing?

Basic checks

rows cols missing_cells missing_cell_pct duplicate_rows has_missing has_duplicates
1475 37 0 0.0 0 No No

Column type summary

dtype n_cols
Int64 29
float64 7
str 1

Details (click to expand):

Int64 — 29 columns
Marital status, Application mode, Application order, Course, Daytime/evening attendance, Previous qualification, Nacionality, Mother's qualification, Father's qualification, Mother's occupation, Father's occupation, Displaced, Educational special needs, Debtor, Tuition fees up to date, Gender, Scholarship holder, Age at enrollment, International, Curricular units 1st sem (credited), Curricular units 1st sem (enrolled), Curricular units 1st sem (evaluations), Curricular units 1st sem (approved), Curricular units 1st sem (without evaluations), Curricular units 2nd sem (credited), Curricular units 2nd sem (enrolled), Curricular units 2nd sem (evaluations), Curricular units 2nd sem (approved), Curricular units 2nd sem (without evaluations)
float64 — 7 columns
Previous qualification (grade), Admission grade, Curricular units 1st sem (grade), Curricular units 2nd sem (grade), Unemployment rate, Inflation rate, GDP
str — 1 columns
Target

Outcome distribution

No data available.

Data Quality Checks

1. Missing Data Check

No missing data issues found (or see Overall tab).

2. Outlier Check

No outlier issues found (or see Overall tab).

3. Sanity Check (Integrity Audit)

Warnings (WARN)
check_name severity affected_rows details
gdp_negative WARN 596 GDP < 0 (data quality issue)
inflation_negative WARN 307 Inflation rate < 0 (deflation or data issue)
Info (Review)
check_name severity affected_rows details
tuition_vs_scholarship_review INFO 19 Scholarship=1 but tuition=0. Review: partial scholarship or payment timing?

Basic checks

rows cols missing_cells missing_cell_pct duplicate_rows has_missing has_duplicates
1475 37 0 0.0 0 No No

Column type summary

dtype n_cols
Int64 29
float64 7
str 1

Details (click to expand):

Int64 — 29 columns
Marital status, Application mode, Application order, Course, Daytime/evening attendance, Previous qualification, Nacionality, Mother's qualification, Father's qualification, Mother's occupation, Father's occupation, Displaced, Educational special needs, Debtor, Tuition fees up to date, Gender, Scholarship holder, Age at enrollment, International, Curricular units 1st sem (credited), Curricular units 1st sem (enrolled), Curricular units 1st sem (evaluations), Curricular units 1st sem (approved), Curricular units 1st sem (without evaluations), Curricular units 2nd sem (credited), Curricular units 2nd sem (enrolled), Curricular units 2nd sem (evaluations), Curricular units 2nd sem (approved), Curricular units 2nd sem (without evaluations)
float64 — 7 columns
Previous qualification (grade), Admission grade, Curricular units 1st sem (grade), Curricular units 2nd sem (grade), Unemployment rate, Inflation rate, GDP
str — 1 columns
Target

Outcome distribution

No data available.

Drift check

This vs last

Target this_pct last_pct delta_pct
Graduate 49.762712 48.474576 1.288136
Dropout 31.661017 33.152542 -1.491525
Enrolled 18.576271 18.372881 0.203390

This vs cumulative

Target this_pct week1_pct delta_pct
Graduate 49.762712 48.474576 1.288136
Dropout 31.661017 33.152542 -1.491525
Enrolled 18.576271 18.372881 0.203390

Data Quality Checks

1. Missing Data Check

No missing data issues found (or see Overall tab).

2. Outlier Check

No outlier issues found (or see Overall tab).

3. Sanity Check (Integrity Audit)

Warnings (WARN)
check_name severity affected_rows details
gdp_negative WARN 572 GDP < 0 (data quality issue)
inflation_negative WARN 316 Inflation rate < 0 (deflation or data issue)
Info (Review)
check_name severity affected_rows details
tuition_vs_scholarship_review INFO 14 Scholarship=1 but tuition=0. Review: partial scholarship or payment timing?

Basic checks

rows cols missing_cells missing_cell_pct duplicate_rows has_missing has_duplicates
1474 37 0 0.0 0 No No

Column type summary

dtype n_cols
Int64 29
float64 7
str 1

Details (click to expand):

Int64 — 29 columns
Marital status, Application mode, Application order, Course, Daytime/evening attendance, Previous qualification, Nacionality, Mother's qualification, Father's qualification, Mother's occupation, Father's occupation, Displaced, Educational special needs, Debtor, Tuition fees up to date, Gender, Scholarship holder, Age at enrollment, International, Curricular units 1st sem (credited), Curricular units 1st sem (enrolled), Curricular units 1st sem (evaluations), Curricular units 1st sem (approved), Curricular units 1st sem (without evaluations), Curricular units 2nd sem (credited), Curricular units 2nd sem (enrolled), Curricular units 2nd sem (evaluations), Curricular units 2nd sem (approved), Curricular units 2nd sem (without evaluations)
float64 — 7 columns
Previous qualification (grade), Admission grade, Curricular units 1st sem (grade), Curricular units 2nd sem (grade), Unemployment rate, Inflation rate, GDP
str — 1 columns
Target

Outcome distribution

No data available.

Drift check

This vs last

Target this_pct last_pct delta_pct
Graduate 51.560380 49.762712 1.797668
Dropout 31.546811 31.661017 -0.114206
Enrolled 16.892809 18.576271 -1.683463

This vs cumulative

Target this_pct cum(w1+w2)_pct delta_pct
Graduate 51.560380 49.118644 2.441736
Dropout 31.546811 32.406780 -0.859968
Enrolled 16.892809 18.474576 -1.581768

Data Quality Checks

1. Missing Data Check

No missing data issues found (or see Overall tab).

2. Outlier Check

No outlier issues found (or see Overall tab).

3. Sanity Check (Integrity Audit)

Warnings (WARN)
check_name severity affected_rows details
gdp_negative WARN 543 GDP < 0 (data quality issue)
inflation_negative WARN 300 Inflation rate < 0 (deflation or data issue)
Info (Review)
check_name severity affected_rows details
tuition_vs_scholarship_review INFO 13 Scholarship=1 but tuition=0. Review: partial scholarship or payment timing?

Feature Profiling

Variable distributions by feature group (binary, continuous, categorical)

Data Dictionary: For coded variables, see UCI Dataset Documentation ↗

Demographics

Binary variables

Continuous/Count variables

Categorical variables

Family background

Categorical variables

Financial / administrative

Binary variables

Admissions & pathway

Continuous/Count variables

Categorical variables

Academic signals (Sem 1)

Continuous/Count variables

Academic signals (Sem 2)

Continuous/Count variables

Macro context

Continuous/Count variables

Target Analysis

How different variables relate to outcomes (Graduate / Enrolled / Dropout)

Gender

Scholarship holder

Tuition fees up to date

Debtor

International

Age at enrollment

Admission grade

Curricular units 1st sem (approved)

Curricular units 1st sem (grade)

Feature Engineering

Temporal trajectories and text-derived features

Longitudinal Features

Time-series features capturing student trajectory over semesters

Feature Name Description Type Mean
Delta_Grade Change in average grade from Semester 1 to Semester 2 Continuous -0.411
Grade_Improvement Binary indicator of grade improvement Binary 0.351
Grade_Decline Binary indicator of grade decline Binary 0.389
CR_Sem1 Credit Ratio Semester 1 (approved/enrolled) Continuous 0.727
CR_Sem2 Credit Ratio Semester 2 (approved/enrolled) Continuous 0.688
Delta_CR Change in Credit Ratio from Semester 1 to Semester 2 Continuous -0.039
Completion_Collapse Binary indicator of completion rate collapse Binary 0.022
Fail_Crossing Binary indicator of crossing the failure threshold Binary 0.044
Borderline_Collapse Binary indicator of borderline performance collapse Binary 0.017
EvalPressure_Sem1 Evaluation pressure Semester 1 (evaluations/enrolled) Continuous 1.345
EvalPressure_Sem2 Evaluation pressure Semester 2 (evaluations/enrolled) Continuous 1.315
Delta_EvalPressure Change in evaluation pressure between semesters Continuous -0.030
GhostRate_Sem1 Ghost enrollment rate Semester 1 Continuous 0.023
GhostRate_Sem2 Ghost enrollment rate Semester 2 Continuous 0.026
Delta_GhostRate Change in ghost enrollment rate Continuous 0.003
Ghost_Worsening N/A Binary 0.011
Key Insight: Longitudinal features (especially Delta_CR and CR_Sem2) often capture whether performance is improving or declining over time. Delta-based features can add signal beyond point-in-time snapshots by reflecting trajectory and momentum.

NLP Features (Simulated)

Text-derived psychological and behavioral signals

Feature Name Description Source Coverage Mean
Academic_Stress Stress level indicator derived from sentiment analysis Sentiment analysis 70.8% 0.206
Academic_Stress_Level Categorical stress level (Low/Medium/High) Sentiment analysis 70.8% N/A
Home_Support_Risk Family support risk score (0=stable, 1=at-risk) Topic modeling 70.8% 0.215
Subject_Specific Subject-specific difficulties identified Keyword extraction 100.0% N/A
Subject_Risk_Flag Binary indicator of subject-specific difficulty Keyword extraction 100.0% 0.328
Subject_Difficulty_Score Numerical difficulty score for subject Keyword extraction 70.8% 0.340

Academic Stress Level Distribution

NLP Distribution

Sample Simulated Texts

Student 0:

"I'm enjoying my courses and managing time well. English writing assignments are very difficult."

Stress: Low Home Risk: Medium Subject: Language
Student 3:

"I have good support from family and doing well. Language barrier is making comprehension hard."

Stress: Low Home Risk: Low Subject: Language
Student 4:

"Some courses are harder than expected, but I'm coping. English writing assignments are very difficult."

Stress: Medium Home Risk: Low Subject: Language
Note: These features are simulated for demonstration. A production implementation would use actual student text data.

Correlation Analysis

Examining relationships between different feature types

Type 1: Static Variables

Correlations among demographic and enrollment features

Interpretation: Static features often show modest correlations. Typical patterns include relationships among age, admission grade, and prior qualifications.

Type 2: Static vs Longitudinal

How baseline characteristics relate to performance trajectories

Interpretation: Baseline demographic features often have weak-to-moderate correlations with longitudinal change features, suggesting that semester-to-semester dynamics are not fully explained by starting characteristics alone.

Type 3: Longitudinal vs NLP

Links between performance metrics and psychological indicators

Interpretation: Stress-related features may be negatively related to credit ratios, while home-support risk can align with indicators of academic decline, depending on how the simulated signals were generated.

Feature Importance

Identifying predictive features using multiple methods

L1 Logistic Regression

Sparse feature selection through L1 regularization

Rank Feature Coefficient
1 CR_Sem2 -1.171008
2 Tuition fees up to date -0.726209
3 Curricular units 1st sem (approved) -0.604850
4 Curricular units 2nd sem (credited) 0.317461
5 Curricular units 2nd sem (approved) -0.305634
6 Age at enrollment 0.258476
7 Mother's occupation -0.233480
8 Scholarship holder -0.219680
9 Curricular units 1st sem (credited) 0.207680
10 International -0.168801
Interpretation: L1 regularization performs feature selection by shrinking less informative coefficients toward zero. Features with non-zero coefficients carry the strongest signal.

Random Forest

Feature importance based on impurity reduction

Rank Feature Importance
1 CR_Sem2 0.165800
2 CR_Sem1 0.115122
3 Curricular units 2nd sem (approved) 0.111475
4 Curricular units 2nd sem (grade) 0.098250
5 Tuition fees up to date 0.063876
6 Curricular units 1st sem (approved) 0.057655
7 Curricular units 1st sem (grade) 0.040976
8 Academic_Stress 0.034186
9 Age at enrollment 0.025320
10 Delta_CR 0.020882
Interpretation: Random Forest importance reflects how much each feature reduces prediction uncertainty across splits. This method can capture non-linear relationships.

Combined Ranking

Consensus ranking across L1 and Random Forest methods

Feature Coefficient Importance Combined_Score L1_Rank RF_Rank
CR_Sem2 -1.171008 0.165800 5.937941 1.0 1.0
Tuition fees up to date -0.726209 0.063876 3.662985 2.0 5.0
Curricular units 1st sem (approved) -0.604850 0.057655 3.053077 3.0 6.0
Curricular units 2nd sem (credited) 0.317461 0.003135 1.588870 4.0 36.0
Curricular units 2nd sem (approved) -0.305634 0.111475 1.583906 5.0 3.0
Age at enrollment 0.258476 0.025320 1.305040 6.0 9.0
Mother's occupation -0.233480 0.008997 1.171901 7.0 22.0
Scholarship holder -0.219680 0.005523 1.101163 8.0 28.0
Curricular units 1st sem (credited) 0.207680 0.004117 1.040456 9.0 32.0
International -0.168801 0.000000 0.844006 10.0 51.0
Summary: Features that rank highly across both methods are typically the most stable candidates for downstream decision support and intervention design.

Model Overview

Dataset summary and modeling approach

Dataset Statistics

Total Samples
4424
Training Set
3539
Test Set
885
Features
61

Target Distribution

Outcome Count Percentage
Graduate 2209 49.9%
Dropout 1421 32.1%
Enrolled 794 17.9%
Modeling Strategy: Multi-class classification with balanced class weights to address class imbalance.

Class Imbalance Analysis

Light Imbalance - Class ratio: 2.78:1 (Largest class: Graduate = 2,209, Smallest class: Enrolled = 794)

Handling Strategy

Strategy Rationale
class_weight_only With a class ratio of 2.78:1, a conservative class_weight="balanced" approach is a reasonable baseline. This adjusts the loss contribution by class without modifying the observed data distribution.

Best Model Selection

Model
XGBoost
ROC-AUC (Macro)
0.886
Attribute Details
Purpose Comparison benchmark
Regularization Strategy L1/L2 regularization + tree constraints
Selection Criteria: Best model selected based on macro-averaged ROC-AUC and F1-score, balancing performance across the three outcome classes.

Model Performance Comparison

Model Purpose Precision Recall F1-Score ROC-AUC
Logistic Regression (L1) Interpretability baseline - identify core risk factors 0.629 0.627 0.622 0.809
Decision Tree Rule extraction - interpretable decision logic for stakeholders 0.668 0.634 0.635 0.840
Random Forest Performance benchmark - candidate primary model 0.703 0.705 0.694 0.875
XGBoost Comparison benchmark 0.705 0.682 0.689 0.886
Model Selection: Best model selected based on macro-averaged F1-score and ROC-AUC.

Detailed Model Information

Logistic Regression (L1)

Purpose: Interpretability baseline - identify core risk factors
Regularization: L1 penalty (C=0.1) + balanced class weights

Confusion Matrix

Pred: Graduate Pred: Dropout Pred: Enrolled
True: Graduate 175 73 36
True: Dropout 32 79 48
True: Enrolled 32 71 339

Top 10 Feature Importance

Rank Feature Importance
1 Curricular units 2nd sem (grade) 0.0480
2 Curricular units 2nd sem (approved) 0.0368
3 Curricular units 1st sem (grade) 0.0306
4 Curricular units 1st sem (approved) 0.0302
5 Curricular units 2nd sem (evaluations) 0.0210
6 Age at enrollment 0.0204
7 Delta_Grade 0.0174
8 Curricular units 1st sem (evaluations) 0.0153
9 Application mode 0.0067
10 Unemployment rate 0.0066

Decision Tree

Purpose: Rule extraction - interpretable decision logic for stakeholders
Regularization: Max depth=5, min samples leaf=50

Confusion Matrix

Pred: Graduate Pred: Dropout Pred: Enrolled
True: Graduate 172 75 37
True: Dropout 19 80 60
True: Enrolled 1 91 350

Top 10 Feature Importance

Rank Feature Importance
1 CR_Sem2 0.7330
2 Tuition fees up to date 0.0896
3 Delta_CR 0.0717
4 CR_Sem1 0.0364
5 Curricular units 2nd sem (evaluations) 0.0201
6 GDP 0.0127
7 Age at enrollment 0.0111
8 EvalPressure_Sem1 0.0096
9 Delta_EvalPressure 0.0048
10 Course 0.0042

Random Forest

Purpose: Performance benchmark - candidate primary model
Regularization: Balanced class weights + depth and leaf constraints

Confusion Matrix

Pred: Graduate Pred: Dropout Pred: Enrolled
True: Graduate 201 64 19
True: Dropout 26 98 35
True: Enrolled 13 80 349

Top 10 Feature Importance

Rank Feature Importance
1 CR_Sem2 0.1565
2 Curricular units 2nd sem (approved) 0.1390
3 CR_Sem1 0.1200
4 Curricular units 2nd sem (grade) 0.0775
5 Curricular units 1st sem (approved) 0.0750
6 Tuition fees up to date 0.0464
7 EvalPressure_Sem2 0.0393
8 Curricular units 1st sem (grade) 0.0372
9 Academic_Stress 0.0352
10 EvalPressure_Sem1 0.0341

Fairness Audit Overview

Features Audited
7
DI Pass Rate
100%
EO Pass Rate
100%
Overall Status
Pass
Fairness Criteria: Disparate Impact ≥ 0.8 and Equalized Odds differences < 0.1.

Feature: Gender

Metric Result Status
Disparate Impact 1.000 Fair
TPR Difference 0.000 Fair
FPR Difference 0.000 Fair

Feature: Scholarship holder

Metric Result Status
Disparate Impact 1.000 Fair
TPR Difference 0.000 Fair
FPR Difference 0.000 Fair

Feature: Displaced

Metric Result Status
Disparate Impact 1.000 Fair
TPR Difference 0.000 Fair
FPR Difference 0.000 Fair

Feature: International

Metric Result Status
Disparate Impact 1.000 Fair
TPR Difference 0.000 Fair
FPR Difference 0.000 Fair

Feature: Nacionality

Metric Result Status
Disparate Impact 1.000 Fair
TPR Difference 0.000 Fair
FPR Difference 0.000 Fair

Feature: Debtor

Metric Result Status
Disparate Impact 1.000 Fair
TPR Difference 0.000 Fair
FPR Difference 0.000 Fair

Feature: Educational special needs

Metric Result Status
Disparate Impact 1.000 Fair
TPR Difference 0.000 Fair
FPR Difference 0.000 Fair